The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.
We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.
Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.
| model | pass1 | win_rate | elo | |
|---|---|---|---|---|
| 0 | opencodeinterpreter-ds-33b | 0.738 | 0.777 | 1248.264 |
| 1 | meta-llama-3-70b-instruct | 0.720 | 0.754 | 1226.816 |
| 2 | mixtral-8x22b-instruct-v0.1 | 0.720 | 0.743 | 1213.711 |
| 3 | HuggingFaceH4--starchat2-15b-v0.1 | 0.713 | 0.743 | 1214.953 |
| 4 | deepseek-coder-7b-instruct-v1.5 | 0.713 | 0.742 | 1213.325 |
| 5 | opencodeinterpreter-ds-6.7b | 0.701 | 0.715 | 1186.565 |
| 6 | xwincoder-34b | 0.695 | 0.706 | 1177.236 |
| 7 | speechless-coder-ds-6.7b | 0.659 | 0.653 | 1131.674 |
| 8 | code-llama-70b-instruct | 0.659 | 0.647 | 1125.044 |
| 9 | white-rabbit-neo-33b-v1 | 0.659 | 0.646 | 1124.281 |
| 10 | speechless-starcoder2-15b | 0.628 | 0.597 | 1084.457 |
| 11 | bigcode--starcoder2-15b-instruct-v0.1 | 0.604 | 0.557 | 1048.541 |
| 12 | microsoft--Phi-3-mini-4k-instruct | 0.591 | 0.548 | 1042.756 |
| 13 | Qwen--Qwen1.5-72B-Chat | 0.591 | 0.539 | 1035.047 |
| 14 | code-13b | 0.524 | 0.442 | 955.455 |
| 15 | speechless-starcoder2-7b | 0.518 | 0.427 | 942.880 |
| 16 | codegemma-7b-it | 0.518 | 0.420 | 938.507 |
| 17 | speechless-coding-7b-16k-tora | 0.506 | 0.409 | 927.939 |
| 18 | code-33b | 0.494 | 0.392 | 912.610 |
| 19 | open-hermes-2.5-code-290k-13b | 0.488 | 0.382 | 903.425 |
| 20 | starcoder2-15b-oci | 0.433 | 0.307 | 836.395 |
| 21 | codegemma-7b | 0.415 | 0.306 | 834.815 |
| 22 | mixtral-8x7b-instruct | 0.396 | 0.266 | 796.939 |
| 23 | mistralai--Mistral-7B-Instruct-v0.2 | 0.360 | 0.224 | 753.146 |
| 24 | gemma-1.1-7b-it | 0.354 | 0.203 | 727.874 |
| 25 | octocoder | 0.329 | 0.195 | 721.395 |
| 26 | python-code-13b | 0.305 | 0.160 | 675.949 |